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1  Summary 

During  the  contract  period  our  main  results  are  a  computer  code  for  fast  parallel  algorithms 
for  particle  systems  interacting  with  long-range  forces,  analysis  of  the  error  characteristics 
of  the  chosen  method,  and  a  parallel  implementation  of  a  0{N  logl  N)  algorithm  for  Legen¬ 
dre  and  Spherical  transforms.  We  have  also  derived  an  algebraic  framework  for  describing 
permutations  frequently  used  in  scientific  computation.  The  framework  allows  for  a  rigorous 
analysis  of  the  communication  requirements  of  parallel  algorithms  and  is  also  very  useful  in 
address  computations  during  compilation  or  in  run-time  systems.  For  efficient  data  motion, 
or  remote  references,  we  have  also  further  validated  the  potential  benefits  of  ROMM  routing. 

Fast  N-body  algorithms 

In  the  first  phase  of  our  project  we  showed  that  Anderson’s  method  [2],  based  on  Pois¬ 
son’s  formula,  can  be  efficiently  implemented  in  a  high-level  language  on  scalable  parallel 
architectures  [10,  11],  such  as  Connection  Machine  Fortran  (CMF)  [21].  High  Performance 
Fortran  (HPF)  [7]  has  adopted  many  of  the  features  of  CMF.  We  simulated  particle  systems 
with  up  to  100  million  particles  on  CM-5  and  CM-5E  systems.  The  code  was  listed  as  an 
“impressive  entry”  in  the  1994  Gordon  Bell  competition. 

The  parallel  implementation  allowed  us  to  develop  an  improved  understanding  of  the  accuracy- 
computational  effort  trade-off  for  Anderson’s  hierarchical  method.  For  three  digits  of  ac¬ 
curacy  the  method  is  competitive  with  direct  iV-body  solvers  at  about  8,700  particles  for 
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three-dimensional  problems,  while  for  six  digits  of  accuracy  the  break-even  point  is  at  about 
38,000  particles.  These  empirical  results  are  based  on  comparisons  with  the  direct  0{N^) 
method  for  systems  with  up  to  1  million  particles. 

Hierarchical  algorithms  are  the  only  alternative  for  simulation  of  large  scale  particle  systems 
with  long  range  forces  whether  gravitational  or  Columbic.  Large  scale  for  astrophysical 
systems  and  for  certain  molecular  dynamics  simulations  may  involve  hundreds  of  millions  of 
particles.  Proposed  methods  for  large  particle  systems  include  the  0{N  \og2  method  by 
Barnes  and  Hut  [3]  and  further  developed  by  Salmon  et.  al.  [17,  18],  and  0{N)  methods  by 
Rokhlin  and  Greengard  [8]  (Multi-pole  expansions),  Anderson  [2]  (Poisson’s  formula),  and 
Brandt  [4]  (multi-grid). 

Anderson’s  method  like  the  multi-pole  and  multi-grid  methods  is  an  approximate  technique. 
For  all  methods  higher  accuracy  can  be  achieved  at  increased  computational  effort.  Though 
the  computational  time  is  directly  proportional  to  the  number  of  particles,  the  constant  of 
proportionality  depends  upon  the  desired  accuracy.  Rokhlin  and  Greengard  has  given  error 
bounds  35  a  function  of  the  number  of  terms  in  the  multi-pole  expansion,  but  accurate  error 
estimates,  i.e.,  how  many  terms  are  needed  for  a  given  desired  accuracy  and  how  it  depends 
upon  the  particle  distribution  is  not  well  understood.  Similarly,  Anderson  provides  some 
insights  into  the  error  characteristics  of  his  method,  but  no  rigorous  tight  bounds. 

In  our  empirical  study  we  explored  how  the  accuracy  varies  with  the  parameters  of  the 
method  (radii  of  the  spheres  of  integration,  number  of  terms  in  the  expansion,  the  order  of  the 
integration  method  (number  of  integration  points),  the  depth  of  the  near-field,  and  the  depth 
of  the  hierarchy),  the  distribution  of  particles,  and  the  impact  of  using  supernodes  [11,  24]. 
Our  studies  showed  that  for  the  evaluation  of  Columbic  forces  in  three-dimensions,  using 
near-fields  of  a  depth  equal  to  one  box  and  a  hierarchy  depth  that  minimizes  the  number  of 
floating-point  operations  give  the  best  running  time  for  any  desired  accuracy.  Supernodes 
are  not  applicable  for  near-fields  with  a  depth  of  a  single  box. 

We  also  studied  how  small  variations  of  particle  distributions  would  affect  the  accuracy  of 
the  method.  We  focused  on  varying  the  distribution  of  particles  within  leaf-level  boxes  while 
keeping  the  number  of  particles  per  leaf-level  box  constant.  Our  studies  showed  that  with 
particles  clustered  in  a  corner  of  a  leaf-level  box  and  six  digits  of  accuracy,  close  to  one  digit 
additional  accuracy  was  achieved  if  the  particles  instead  were  clustered  at  the  center  of  the 
leaf-level  box.  For  unsymmetrical  distributions,  we  found  that  enlarging  the  outer  spheres 
of  integration,  which  intuitively  should  smooth  the  field  to  be  integrated,  did  not  improve 
the  accuracy. 

A  report  summaxizing  the  above  results  is  in  preparation  [9].  Implementation  techniques 
and  benchmark  data  are  presented  in  [10,  11]. 
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Fast  transforms 


We  are  in  the  proess  of  devising,  implementing  and  analyzing  parallel  versions  of  polynomial 
Cosine  Transforms  and  a  fast  Legendre  Transform  (FLT)  recently  discovered  by  Driscoll  and 
Healy  [5].  The  Legendre  Transform  is  the  basic  component  of  spherical  harmonics,  which  are 
used  extensively  in  many  scientific  applications,  especially  meteorology  and  environmental 
sciences.  The  FLT  computes  the  Legendre  Transform  in  O(iVlog^iV)  operations  compared 
to  0{N^)  operations  for  the  classical  approach.  Aside  from  round-off  errors,  the  Driscoll- 
Healey  FLT  is  exact,  as  opposed  to  the  0{Nlo^  N)  FLT  proposed  by  Alpert  and  Rokhlin 
[1],  which  is  based  on  the  multi-pole  expansion  technique. 

Novel  modifications  of  the  Driscoll-Healey  algorithm  that  have  been  made  in  the  course 
of  this  work  is  replacing  convolutions  with  Cosine  Transforms.  This  has  been  joint  work 
with  Maslen  [12].  Since  the  Discrete  Cosine  Transform  (DCT)  is  the  key  building  block  in 
our  parallel  FLT,  most  of  the  effort  has  been  dedicated  to  comparing  the  Classical  DOT 
[15,  16,  20],  which  makes  use  of  the  Fast  Fourier  Transform  as  the  main  computational 
structure,  and  the  Polynomial  DCT  [19].  Variations  of  both  algorithms  have  been  compared 
from  the  perspective  of  parallel  arithmetic,  memory  and  communication  complexity,  as  well 
as  load-balance.  A  report  is  in  preparation. 


Efficient  multi-processor  communication 


We  have  demonstrated  that  a  new  routing  technique,  ROMM  routing,  is  only  marginally 
less  efficient  than  Dimension-order  routing  when  such  routing  is  optimal,  and  two  to  four 
times  faster  when  Dimension-order  routing  performs  poorly.  Similarly,  ROMM  routing  is 
two  to  four  times  faster  than  fully  randomized  routing  for  many  interesting  permutations, 
and  only  marginally  slower  when  fully  randomized  routing  performs  well.  The  technique  and 
simulation  results  are  documented  in  two  conference  papers  [13,  14]. 

Three  approaches  to  communication  in  multi-processor  systems  is  through  libraries,  special 
compilers,  or  general  routers,  with  each  approach  having  its  advantages  and  disadvantages. 
General  purpose  routers  are  clearly  necessary  in  general  purpose  computing  environments, 
but  the  performance  is  often  not  sufficiently  good  for  critical  applications.  For  such  ap¬ 
plications  communications  libraries  are  typically  used.  However,  such  libraries  are  quite 
expensive  to  produce  for  production  environments  in  which  machine  sizes  and  data  array 
sizes  and  shapes  may  vary  considerably.  With  an  average  development  cost  in  the  range 
$100,000  -  $150,000  per  library  function,  very  few  can  be  developed  for  each  system,  both 
from  a  cost  point  of  view  and  a  time-to— market  point  of  view.  Compiled  routing  can  consider¬ 
ably  reduce  the  cost  and  development  time  for  library  functions,  or  replace  library  functions 
alltogether.  A  communication  compiler  was  developed  for  the  Connection  Machine  CM- 
2/200.  This  compiler  did  not  generate  code  sufficiently  efficient  for  common  communication 
patterns  to  eliminate  the  need  for  hand— optimized  code,  but  produced  sufficently  efficient 
code  to  warrant  its  use  for  many  irregular  communications  instead  of  the  general  router.  A 
communication  compiler  was  also  developed  at  CMU  for  the  iWarp  [6],  for  which  is  was  used 
successfully  in  a  restricted  setting.  One  fundamental  difficulty  with  compiled  routing  is  its 
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limited  ability  to  efficiently  handle  communication  patterns  not  known  until  run-time.  For 
arbitrary  communication  patterns,  and  in  particular  for  patterns  that  are  not  known  until 
run-time,  efficient  general  routers  are  highly  desirable. 

We  have  devised  and  explored  a  new  routing  technique,  ROMM  (Randomized  Oblivious 
Minimal  Multi-phase)  routing.  The  idea  is  to  attempt  to  combine  the  best  properties  of 
minimal  routing  and  randomized  routing.  Minimal  algorithms  are  optimal  for  a  number 
of  important  routing  patterns  on  some  networks  (such  as  CSHIFT  on  meshes  and  binary 
cubes,  bit-complement  permutations  and  random  permutations).  However,  the  most  popular 
minimal  algorithm,  Dimension-order  routing,  performs' very  poorly  on  permutations  such 
as  transposition  and  bit-reversal.  Fully  randomized  algorithms,  such  as  Valiant-Brebner 
routing  [22,  23],  may  perform  better  than  Dimension-order  routing  for  permutations  such 
as  transpose  and  bit-reversal  (asymptotically  fully  randomized  algorithms  are  guaranteed  to 
perform  better),  but  usually  performs  relatively  poorly  on  permutations  where  Dimension- 
order  routing  performs  well. 

ROMM  provides  a  mechanism  for  controlling  the  amount  of  randomization  introduced  and 
limiting  the  resources  required  for  deadlock  freedom.  The  technique  is  straightforward 
and  may  be  used  for  general-purpose  routing  algorithms  in  networks  which  use  store-and- 
forward,  virtual  cut-through,  or  wormhole  routing. 

Using  the  ROMM  framework,  we  have  developed  a  method  for  defining  and  analyzing  al¬ 
gorithms  in  the  class,  and  defined  ROMM  algorithms  for  mesh,  torus,  and  binary  cube 
networks.  Analytical  results  show  that  these  algorithms  have  the  potential  to  outperform 
deterministic,  oblivious  routing  algorithms  and  fully-randomized  routing  algorithms  for  a 
variety  of  representative  routing  tasks.  Using  a  parallel  simulator,  we  have  shown  that  for 
wormhole-routed  mesh,  torus,  and  binary  cube  networks  with  up  to  1024  nodes,  ROMM 
algorithms  are  competitive  with  Dimension-order  routing,  and  in  some  cases,  more  than  two 
times  faster  for  two-dimensional  square  meshes  with  up  to  1024  nodes  and  faster  still  for 
binary  cube  networks.  Our  results  also  show  that  ROMM  algorithms  are  up  to  three  times 
faster  than  Valiant-Brebner  routing  for  many  routing  problems  and  that  ROMM  routing 
scales  well  to  larger  network  sizes. 
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1995  -  present  Chair,  Chair  Search  Committee,  Department  of  Computer  Science, 

University  of  Houston 

1995  -  present  Campus  Computing  Working  Group,  University  of  Houston 

1993  Serge  G.  Petiton,  L’Habilitation  a  Diriger  des  Recherches, 

Contribution  a  une  Methodologie  Globale  Pour  Le  Calcul 
Scientifique  Parallele,  University  of  Paris  VI. 


6  Honors  and  Awards: 

“Impressive  Entry”  recognition  in  the  1994  Gordon  Bell  Prize  contest  (with  Yu  Hu). 


7  Journal  Publications 

“Local  Basic  Linear  Algebra  Subroutines  (LBLAS)  for  the  CM-5/5E”,  (with  David  Kramer 
and  Yu  Hu),  to  appear  in  the  International  Journal  of  Supercomputer  Applications,  vol.  , 
no.  ,  pp.  ,  1996. 

“A  Data  Parallel  Implementation  of  Hierarchical  IV-body  Methods”,  (with  Yu  Hu),  to  appear 
in  the  International  Journal  of  Supercomputer  Applications,  vol.  ,  no.  ,  pp.  ,  1996. 

“Implementing  0{N)  N-body  algorithms  efficiently  in  data  parallel  languages”,  (with  Yu 
Hu),  to  appear  in  the  Journal  of  Scientific  Programming,  vol.  ,  no.  ,  pp.  ,  1996. 

”All-to-All  Communication  on  the  Connection  Machine  system  CM-200”,  (with  Kapil  K. 
Mathur),  the  Journal  of  Scientific  Programming,  vol.  4,  no.  4,  pp.  251  -  273,  1995. 

“On  the  Conversion  between  Binaxy  Code  and  Binary  Reflected  Gray  Code”,  (with  Ching- 
Tien  Ho),  in  IEEE  Transactions  on  Computers,  vol.  44,  no.  1,  pp.  47  -  53,  January  1995. 

“Index  Transformation  Algorithms  in  a  Linear  Algebra  Framework” ,  (with  Alan  Edelman 
and  Steve  Heller),  in  Transactions  on  Parallel  and  Distributed  Systems,  vol.  5,  no.  12,  pp. 
1302  -  1309,  1994. 

’’Scalability  of  Finite  Element  Applications  on  Distributed-Memory  Parallel  Computers”, 
(with  Zdenek  Johan  and  Kapil  K.  Mathur  and  S.  Lennart  Johnsson  and  Thomas  J.R. 
Hughes),  in  Computer  Methods  in  Applied  Mechanics  and  Engineering,  vol.  119,  nos.  1 
-  2,  pp.  61  -  72,  November  1994. 
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’’Issues  in  High  Performance  Computer  Networks”,  in  IEEE  Technical  Committee  on  Com¬ 
puter  Architecture  Newsletter,  Summer  -  Fall  1994,  pp.  14  -  19. 

“Optimal  Communication  Channel  Utilization  for  Matrix  Transposition  and  Related  Per¬ 
mutations  on  Boolean  Cubes” ,  (with  Ching-Tien  Ho)  in  the  Journal  of  Discrete  Applied 
Mathematics,  vol.  53,  pp.  251  -  274,  September  1994. 

“Multiplication  of  Matrices  of  Arbitrary  Shape  on  a  Data  Parallel  Computer” ,  (with  Kapil 
K.  Mathur),  in  Journal  of  Parallel  Computing,  vol.  20,  no.  7,  pp.  919  -  951,  July,  1994. 

“An  Efficient  Communication  Strategy  for  Finite  Element  Methods  on  the  Connection  Ma¬ 
chine  CM-5  System”,  (with  Zdenek  Johan,  Kapil  K  Mathur,  and  Thomas  J.R.  Hughes),  in 
Computer  Methods  in  Applied  Mechanics  and  Engineering,  vol.  113,  pages  363  -  387,  1994. 

“POLYSHIFT  Communications  Software  for  the  Connection  Machine  System  CM-200”, 
(with  Ralph  Brickner  and  William  George),  Journal  of  Scientific  Programming,  vol.  3,  no. 
1,  pp.  83  -  99,  Spring  1994. 

“Boolean  Cube  Emulation  of  Butterfly  Networks  Encoded  by  Gray  Code”  (with  Ching-Tien 
Ho),  Journal  of  Parallel  and  Distributed  Computing,  vpl.  20,  no.  3,  pp  261  -  279,  1994. 

“An  Efiicient  Algorithm  for  Gray-to-Binary  Permutation  on  Hypercubes”,  (with  Ching- 
Tien  Ho  and  M.T.  Raghunath),  Journal  of  Parallel  and  Distributed  Computing,  vol.  20,  no. 
1,  pp.  114  -  120,  1994. 

“Embedding  Hyper-pyramids  in  Hypercubes”,  (with  Ching-Tien  Ho),  IBM  Journal  of  Re¬ 
search  and  Development,  vol.  38,  no.  1,  pp.  31  -  45,  1994. 

“Minimizing  the  Communication  Time  for  Matrix  Multiplication  on  Multiprocessors”,  Jour¬ 
nal  of  Parallel  Computing,  vol.  19,  no.  11,  pp.  1235  -  1257,  1993. 

“Block  Cyclic  Dense  Linear  Algebra”,  (with  Woody  Lichtenstein),  SIAM  J.  of  Sci.  Comp., 
vol.  14,  no.  6,  pp.  1257  -  1286,  1993. 


8  Invited  Presentations 


1995 

“Data  Partitioning  for  Load-Balance  and  Communication  Bandwidth  Preservation”,  The 
Second  International  Conference  on  Massively  Parallel  Processing  and  Optical  Interconnec¬ 
tions,  San  Antonio,  Texas,  October  23  -  24,  1995. 

“Structured  Linear  Algebra  Software  on  Scalable  Architectures” ,  International  Congress  on 
Industrial  and  Applied  Mathematics,  Hamburg  Germany,  July  3  -  7,  1995. 

“On  the  Accuracy  of  Fast  N-body  Algorithms”,  AFOSR  Pl-meeting,  Phillips  Laboratory, 
Kirtland  Air  Force  Base,  Albuquerque,  New  Mexico,  June  28  -  30,  1995. 

“A  Stencil  Compiler  for  the  Connection  Machine  Model  CM-5” ,  5th  Workshop  on  Compilers 
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for  Parallel  Computers^  Malaga,  Spain,  June  28  -  30,  1995. 

“Implementing  0(N)  N-body  Algorithms  Efficiently  in  Data  Parallel  Languages  (High  Per¬ 
formance  Fortran)”,  Los  Alamos  National  Laboratories,  Los  Alamos,  New  Mexico,  June  15, 
1995. 

“On  the  Error  in  Anderson’s  Fast  N-body  Algorithm”,  The  Royal  Institute  of  Technology, 
May  30,  1995,  Stockholm,  Sweden. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
Michigan  State  University,  East  Lansing,  March  16  -  17,”  1995. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
the  Mardi  Gras  Conference  on  High  Performance  Computing  Technologies,  Baton  Rouge, 
Luisianna,  February  23  -  25,  1995. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
the  Institute  for  Computer  Science,  Linkoping  University,  Linkoping,  Sweden,  January  10, 
1995. 


1994 


“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
the  Parallel  Computation  Center  Annual  Symposium,  the  Royal  Institute  of  Technology, 
Stockholm,  Sweden,  December  15  -  16,  1994. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
Northwestern  University,  Evanston,  Illinois,  December  7,  1994. 

“ROMM  Routing:  A  Class  of  Efficient  Minimal  Rouitng  Algorithms” ,  Applied  Mathematics 
Seminar  series,  California  Institute  of  Technology,  Pasadena,  California,  December  1,  1994. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
Center  for  Research  in  Parallel  Computation,  California  Institute  of  Technology,  Pasadena, 
California,  November  30,  1994. 

“Scientific  Supercomputing:  Making  MPP’s  deliver  on  their  performance”,  Computacion 
Cientifica  en  Paralelo,  Mexico  City,  Mexico,  October  27  -  28,  1994. 

“Implementing  0(N)  N-body  algorithms  efficiently  in  data  parallel  languages  (High  Perfor¬ 
mance  Fortran)”,  DIMACS  Third  Annual  Implementation  Challenge  Workshop,  DIMACS, 
Rutgers  University,  New  Brunswick,  New  Jersey,  October  17  -  18,  1994. 

“Scientific  Supercomputing:  Making  MPP’s  deliver  on  their  promise  of  high  performance”, 
ICASE  NASA  Langley  Industry  Roundtable,  Williamsburgh,  Virginia,  October  3  -4,  1994. 

“Parallel  Hierarchical  N-Body  Algorithms  for  Long  Range  Forces”,  AFOSR  Workshop  on 
Large  Scale  Simulations  in  Chemistry/Material  Science,  September  12  -  13,  Dayton,  Ohio, 
1994. 
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“ROMM  Routing:  A  Class  of  Efficient  Minimal  Routing  Algorithms” ,  NEC  Research  Insti¬ 
tute,  August  5,  Princeton,  New  Jersey,  1994. 

“Load-Balanced  LU  and  QR  Factor  and  Solve  Routines  for  Scalable  Processors  with  Scalable 
I/O”,  14th  IMACS  World  Congress,  Parallel  Linear  Algebra,  July  11  -  15,  Atlanta,  Georgia, 
1994. 

“High  Performance  Computing:  Scalable  Libraries,  Scalable  Applications”,  14th  IMACS 
World  Congress,  Parallel  Linear  Algebra,  July  11-15,  Atlanta,  Georgia,  1994. 

“Scalable  Scientific  Software  Libraries”,  Workshop  on  Parallel  Scientific  Computing,  UNI-C, 
Lyngby,  Denmark,  June  20  -  23,  1994. 

“Scientific  Supercomputing:  Making  MPPs  deliver  on  their  promise  of  high  performance” , 
the  University  of  Houston,  Houston,  Texas,  May  26,  1994. 

“ROMM  Routing:  A  Class  of  Efficient  Minimal  Routing  Algorithms” ,  Parallel  Computer 
Routing  and  Communication  Workshop,  University  of  Washington,  Seattle,  Washington, 
May  16  -  18,  1994. 

“Data  Motion  in  High  Performance  Computing” ,  First  International  Workshop  on  Massively 
Parallel  Processing  Using  Optical  Interconnections,  Cancun,  Mexico,  April  26  -  27,  1994. 

“Scientific  Computation  on  Scalable  Architectures”,  TIMS  ORSA  Joint  National  Meeting, 
Boston,  Massachusetts,  April  24  -  27,  1994. 

’’Data  Parallel  Finite  Element  Techniques  for  Compressible  Flow  Problems”,  (with  Zdenek 
Johan,  Kapil  K.  Mathur,  and  Thomas  J.R.  Hughes),  Proceedings  of  the  Parallel  Computa¬ 
tional  Fluid  Dynamics  1994  Workshop,  Tokyo,  March  1994. 

“Performance  of  the  Connection  Machine  System  CM-5” ,  ARP  A  High  Performance  Com¬ 
puting  and  Communications  Symposium" ,  Alexandria,  Virginia,  March  15  —  18  1994. 

“Scientific  Libraries  on  Scalable  Architectures”,  Conference  on  Teraflop  Computing,  Baton 
Rouge,  Louisianna,  February  10  -  12,  1994. 

“Locality  in  High  Performance  Parallel  Computing” ,  DIMACS  Workshop  on  Organizing  and 
Moving  Data  in  Parallel  Computers,  Princeton,  New  Jersey,  January  26  -  28,  1994. 

“The  Connection  Machine  System  CM-5”,  the  University  of  Tennessee,  Tennessee,  January 
19,  1994. 


1993 

“Scientific  Libraries  on  Scalable  Architectures”,  Workshop  on  Parallel  Scientific  Computa¬ 
tion,  Stockholm,  Sweden,  December  15  -  17,  1993. 

“A  Stencil  Compiler  for  the  Connection  Machine  Models  CM-2/200”,  Fourth  International 
Workshop  on  Compilers  for  Parallel  Computers,  Delft,  The  Netherlands,  December  13-16, 
1993. 
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“Scientific  Libraxies  on  Scalable  Architectures” ,  Cornell  University,  Ithaca,  New  York,  Novem¬ 
ber  29,  1993. 

“Scientific  Libraries  on  Scalable  Architectures” ,  University  of  Maryland,  College  Park,  Mary¬ 
land,  November  18,  1993. 

“Scientific  Libraries  on  Scalable  Architectures” ,  Bellcore,  Morristown,  New  Jersey,  November 
9,  1993. 

“Scientific  Libraries  on  Scalable  Architectures”,  Los  Alamos  National  Laboratories,  Los 
Alamos,  New  Mexico,  October  29, 1993. 

’’Scalability  of  Finite  Element  Applications  on  Distributed-Memory  Parallel  Computers”, 
(with  Zdenek  Johan  and  Kapil  K.  Mathur  and  S.  Lennart  Johnsson  and  Thomas  J.R. 
Hughes),  Presented  at  the  Symposium  on  Parallel  Finite  Element  Computations,  Minneapo¬ 
lis,  Minnesota,  October,  1993. 

“Scientific  Libraries  on  Scalable  Architectures”,  CERN,  Geneva,  Switzerland,  October  14, 
1993. 

“The  CMSSL” ,  Second  European  Connection  Machine  Users  Group  Conference,  Paris,  France, 
October  13,  1993. 

“Scientific  Libraries  on  Scalable  Architectures” ,  Scalable  Parallel  Libraries  Conference,  Mis¬ 
sissippi  State  University,  Starkville,  Mississippi,  October  6  -  8,  1993. 

“Scientific  Libraries  on  Scalable  Architectures”,  ARPA  HPCC  Semiannual  meeting,  San 
Diego,  California,  September  28  -  29,  1993. 

“Finite  Element  Techniques  for  Computational  Fluid  Dynamics  on  the  Connection  Machine 
CM-5  System”,  with  Z.  Johan,  K.K.  Mathur,  S.L.  Johnsson  and  T.J.R.  Hughes,  the  Second 
US  Congress  on  Computational  Mechanics,  Washington  D.C.,  August  1993. 

“Scientific  Libraries  on  Scalable  Architectures”,  Workshop  on  Portability  and  Performance 
for  Parallel  Processing,  Southampton,  Hampshire,  England,  July  13  -  15,  1993. 

“The  Connection  Machine  System  CM-5”,  SPAA-93,  Sport  Schloss  Velen,  Germany,  June 
30  -  July  2,  1993. 


9  Refereed  Conference  Papers  and  Book  Chapters 

“ROMM  Routing  on  Mesh  and  Torus  Networks” ,  (with  Ted  Nesson)  Proceedings  of  the  7th 
Annual  ACM  Symposium  on  Parallel  Algorithms  and  Architectures,  ACM  Press,pages  275  - 
287,  1995. 

’’Parallel  Implementation  of  Recursive  Spectral  Bisection  on  the  Connection  Machine  CM-5 
System”,  (with  Zdenek  Johan,  Kapil  K.  Mathur  and  Thomas  J.R.  Hughes),  Parallel  Com¬ 
putational  Fluid  Dynamics:  New  Trends  and  Advances,  pages  451  —  459,  Elsevier  Science, 
1995. 
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“ROMM  Routing:  A  Class  of  Efficient  Minimal  Routing  Algorithms” ,  (with  Ted  Nesson) 
Proceedings  of  the  Parallel  Computer  Routing  and  Communication  Workshop,  Springer- 
Verlag,  Lecture  Notes  in  Computer  Science  853,  pages  185  -  199,  1994. 

’’Scientific  Software  Libraries  for  Scalable  Architectures”,  (with  Kapil  K.  Mathur),  in  Parallel 
Scientific  Computing,  Springer  Verlag,  1994. 

’’Data  Motion  and  High  Performance  Computing”,  in  Proceedings  of  the  First  International 
Workshop  on  Massively  Parallel  Processing  Using  Optical  Interconnections,  pages  1  -  18, 
IEEE  Computer  Society,  Order  no.  5832-02,  ISBN  0-818.6-5832-02, 1994. 

’’Mesh  Decomposition  and  Communication  Procedures  for  Finite  Element  Applications  on 
the  Connection  Machine  CM-5  System”,  (with  Zdenek  Johan,  Kapil  K.  Mathur  and  Thomas 

J. R.  Hughes),  in  High-Performance  Computing  and  Networking,  vol.  2,  pages  233  -  240, 
Springer-Verlag,  Lecture  Notes  in  Computer  Science,  1994. 

’’CMSSL:  A  Scalable  Scientific  Software  Library”,  in  Proceedings  of  the  Scalable  Parallel 
Libraries  Conference,  pages  57  -  66,  IEEE  Computer  Society,  Order  no.  4980-02,  ISBN 
0-8186-4980-1, 1994. 

’’High  Performance,  Scalable  Scientific  Software  Libraries”,  (with  Kapil  K.  Mathur)  Porta¬ 
bility  and  Performance  in  Parallel  Processing,  pages  159  -  208,  1994,  John  Wiley  &  Sons. 

”  Massively  Parallel  Computing:  Mathematics  and  Communications  Libraries” ,  (with  Kapil 

K.  Mathur),  Parallel  Supercomputing  in  Atmospheric  Science,  pages  250  -  285,  1993,  World 
Scientific. 

’’The  Connection  Machine  System  CM-5”,  the  Fourth  Symposium  on  Parallel  Algorithms 
and  Architectures,  SPAA-93,  pp.  365  -  366,  1993,  ACM  Press. 

’’Massively  Parallel  Computing:  Unstructured  Finite  Element  Simulations”,  (with  K.  Mathur, 
Zdenek  Johan  and  Thomas  J.R.  Hughes),  NAFEMS:  Proceedings  of  the  Fourth  International 
Conference  on  Quality  Assurance  and  Standards  in  Finite  Element  and  Associated  Technolo¬ 
gies,  NAFEMS,  pp.  158  -  170,  1993. 


10  Unrefereed  Conf.  Papers  and  Technical  reports 

“Structured  Linear  Algebra  Software  on  Scalable  Architectures”,  ICIAM95,  Hamburg,  July 
3-7,  page  54,  ICIAM  Book  of  Abstracts,  1995 

’’Data  Parallel  Finite  Element  Techniques  for  Compressible  Flow  Problems”,  (with  Zdenek 
Johan,  Kapil  K.  Mathur,  and  Thomas  J.R.  Hughes),  Proceedings  of  the  Parallel  Computa¬ 
tional  Fluid  Dynamics  1994  Workshop,  March  1994.  Harvard  University  Technical  Report 
TR-04-94,  January  1994. 

’’Load-Balanced  LU  and  QR  Factor  and  Solve  Routines  for  Scalable  Processors  with  Scalable 
I/O”  (with  Jean-Philippe  Brunet  and  Palle  Pedersen),  in  Proceedings  of  the  14th  IMACS 
World  Congress,  July  11  -  15,  1994,  Atlanta,  Georgia.  Harvard  University  Technical  Report 
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TR-20-94. 

”A  Stencil  Compiler  for  the  Connection  Machine  Models  CM-2/200”,  (with  Ralph  G.  Brick- 
ner,  William  George  and  Alan  Ruttenberg),  Fourth  International  Workshop  on  Compilers 
for  Parallel  Computers,  pages  68  -  78,  Delft,  1993. 

“Finite  Element  Techniques  for  Computational  Fluid  Dynamics  on  the  Connection  Machine 
CM-5  System”,  with  Z.  Johan,  K.K.  Mathur,  S.L,  Johnsson  and  T.J.R.  Hughes,  the  Second 
US  Congress  on  Computational  Mechanics,  Washington  D.C.,  August  1993. 
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